Main feature of interest
Main feature in the dataset is the quality. I am interested in finding out what attributes are affecting the perceived quality.
As a wine lover, I would like to analyse the ‘White Wine Quality’ dataset and try to understand which attributes really contribute to the wine quality. As I am not an expert in wines, I am not sure what to look for exactly however I am hoping the dataset will give me a rough idea what to look for next time when I am purchasing wine.
The dataset contains 4898 observations with 13 variables.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
Before starting, I will check if there are any missing values in the dataset.
## [1] 0
The wine qualities range between 3 to 9. None of the 4898 observations got a score of 1, 2 or 10. The average quality is 5.878.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
The bulk of the wines got a score between 5 to 7.
Alcohol percentage of wines have a range between 8.0 to 14.2. The average percentage is at 10.51.
The pH values are distributed normally and range between 2.72 to 3.82. Mean is at 3.188.
Residual sugar determines the sweetness of the wines. It ranges between 0.6 to 65.8 gr. per liter, wowever majority has 0.6 to 20 gr. per liter.
The densitiy ranges between 0.9871 to 1.0390. However the bulk fo the wines have a density until 1.0025. Mean densitiy for the dataset is 0.994
The fixed acidity ranges between 3.8 and 14.2. Mean fixed acidity is at 6.855.
Volatile acidity starts from 0.08 and ends at 1.1 with a mean at 0.2782.
Some of the wines have no citric acid at all. The highest citric acid rate is at 1.66 and the average rate is 0.3342.
The minimum chloride value for a wine is 0.0090 abd the max is 0.346. The mean is at 0.04577.
The minimum free sulfur dioxide value is 2 and the maximum is 298. However the main bulk is between 23 and 46.
The total sulfur dioxide has a nice normal distribution, starts at 9 and reaches up to 440. The average value is 138.4.
Sulphates have a bimodal distribution. The lowest value is 0.220 and max value is 1.08
The dataset holds information about 4898 white wines. For each wine, there are 11 objective attributes of as well as a overall quality evaluation, which is the median quality evaluation of at least 3 wine experts. All of the attributes are either integer or numerical values.
Main feature in the dataset is the quality. I am interested in finding out what attributes are affecting the perceived quality.
I am not a wine expert so I cannot say for sure which features are directly effecting the quality of the wines. However I believe alcohol percentage, pH level, residual sugar level and density might have a direct effect on the quality scoring. Nevertheless, it is a good idea to keep an eye open also for the other attributes.
I found the distribution of alcohol percentages unusual. I always thought that almost all of the wines have 8-10% alcohol and that all above are rather rare. However, this is not the case. Alcohol disribution seems to be positively skewed however there are also many wines above 12%.
Luckily, the dataset does not have any NA values. I only removed the column called ‘X’ as itwas only there to number the wines, which I will not need.
Let’s start by using GGPAIRS to investigate all bivariate relations:
The features that I mentioned above have the following correlations to quality:
Alcohol percentage: 0.436
pH level: 0.0994
Residual sugar: -0.0976
Density: -0.307
From those, only density and alcohol percentage have correlations that are somewhat high. Let’s investigate them closer by creating boxplots and scatterplots:
There are several interesting things to see here. As mentioned above, there is a correlation of 0.436 between alcohol and quality. This can be seen in the boxplot, where quality from 5 - 9 increases always parallel to higher alcohol percentages. Similarly, the scatter plot also shows a positive upwards trend between the two variables. This means, judges usually tend to give higher quality grades to wines with above 10% alcohol rate.
Density and quality also correlate (-0.307) however as opposed to the alcohol and quality, these two variables have a negative relationship. Again in boxplot chart, we see how quality between 5 - 9 increases as density continously decreases. Also the scatter plot shows a negative trend between the two. Interestingly, in both alcohol and density, quality between 5 - 9 has really clear trends that either increaes or decreases. However quality scores of 3 and 4 are not in line with these trends. This probably means there are some low quality wines with different attributes however in majority higher alcohol percentage and lower density are usually indicators of higher quality scoring.
I also want to check the relationship between other variables. Two really strong correlations here are between density and residual sugar and density and alcohol.
The strongest correlation in the dataset is between density and residual sugar with 0.839:
The second strongest correlation in the dataset is between density and alcohol with -0.78:
I wanted to find out which features directly relate to the quality variable. For this purpose, I investigated alcohol percentage, pH level, residual sugar and density. I discovered that none of these features strongly correlated to quality. Among those, alcohol percentage with 0.436 correlation had the strongest relationship. I also discovered, among better quality wines, usually higher percentage of alcohol led to better quality evaluations. The second strongest correlation was the density with -0.307. Here, I saw an opposite trend: among good quality wines, the less dense ones scored overall better. The pH level and residual sugar had a rather small impact on the quality to my surprise.
Two really strong relationships that I discovered were between density and residual sugar and density and alcohol. As expected more residual sugar meant denser wine and other way around more alcohol meant less dense wine.
The strongest relationship is between density and residual sugar with 0.839.
As first, I want to investigate the relationships between alcohol, density and quality a bit closer.
Here, I plotted alcohol vs. density and colored the quality scores. We can see from the plot the negative relationship between alcohol and density: as alcohol percentage increases, density decreases. Moreover, going towards the top left, we are getting closer to the sweet spot. More and more wines are getting good scores at higher alcohol vs. lower density relationships.
As next, I would like to investigate the relationship between density, residual sugar and alcohol. For this purpose, I will first create buckets for alcohol percentages.
##
## (7,9] (9,10] (10,11] (11,12] (12,13] (13,15]
## 502 1583 1252 850 609 102
This plot shows an interesting relationship between these 3 variables. As the wines become sweeter, the density increases. On the other hand, for the same sweetness level, more alcohol makes the wines less dense.
The most interesting finding here was the relationship between alcohol, density and quality. I discovered denser wines with more alcohol percentage tend to get better scorings.
The relationship between density, residual sugar and alcohol was really interesting. The sweeter the wine, denser it is. On the other for the same residual sugar level, more alcohol percentage meant less density.
As I found out in my investigation, alcohol has the strongest correlation with the quality scores. However, the relationship is not perfectly linear. To show this relationship best, I used boxplots for each quality and marked means of each quality score. The alcohol percentages goes initially down however at better quality ranges (5 - 9), it contiously increases.
Besides having the second strongest relationship with the quality, density also happens to have the strongest correlation among all of the features. With 0.839 correlation to residual sugar, we get a nearly perfect linear relationship, which I tried to capture in the chart above.
After finding out the strongest correlators of quality: alcohol and density, I plotted all three in a multivariate plot with using quality as color. As can be seen, quality increases rapidly towards upper left portion (more alcohol, less dense).
Throughout this investigation, I discovered many cool things about white wines. I found out, that the quality scores did not rely on a sole feature but depended probably on a mixture of many. I discovered that the strongest influencer of the quality in this set was alcohol. After that, it was density. Overall, the judges scored less dense and higher percentage wines better. I also discovered that the relationship for both features were not linear (thus low correlation scores), although for higher scores, they showed clearer trends.
One of the main advantages of the dataset was the fact that there were no missing data. The data was perfectly structured, so I could start right away with the investigation. The dataset also had nearly 5000 wine evaluations, which provided usually enough samples while doing multivariate analysis. One of the main struggles that I had was not having any direct strong influencer of quality. Although there were many features in the dataset, they were mostly weakly correlated with the score.
I believe, as a future work, it would make definitely sense to expand the features to also cover other attributes like grapes types, land where the the wine is produced, year of production and etc. Currently, most of the features in this dataset are of chemical nature and they would probably offer way more insights in combination with non-chemical features